In this data mining project, we aim to analyze the employment and income status of various block groups within Charlotte, NC, specifically focusing on Mecklenburg County. By examining datasets that include features such as household income, population over 16, geography, labor force participation, and unemployment rates, we seek to uncover patterns and trends that define the socio-economic landscape of Charlotte from 2011 to 2019. This analysis will provide valuable insights into how different communities within the city are grouped based on income, labor force participation, and access to amenities like trails.
The findings from this project can have significant implications for both policymakers and residents of Mecklenburg County. By identifying areas with high income and employment rates, as well as those that may need more support, it can be helpful to have direct resources and initiatives more effectively. This can lead to improved better-targeted social programs, and a more informed understanding of the city's growth and challenges. However, it is important to be mindful of the potential for stereotyping communities based on this data, the time where the data was taken, and to use the insights gained in a way that promotes inclusivity and equity.
Clustering is a data mining technique used to group similar data points together based on certain characteristics. It helps in identifying patterns and structures within the data that may not be immediately apparent. One of the most popular clustering algorithms is K-Means.
In the context of our project, K-Means clustering can be used to group different block groups in Charlotte based on their socio-economic features. For example, we can identify clusters of blocks with similar income levels, employment rates, and access to amenities. This will help us understand how different areas of the city are evolving and what common characteristics they share.
For the sake of my project. I wanted to use two different datasets from Charlotte, NC used to display "Census Block Groups for the years 2015-2019 American Community Survey 5-Year Estimates in Mecklenburg County, North Carolina". This data was published on 2022 but it from the collected surveys from 2015-2019 which makes the data old for an well-around analysis and impact. I will be working with two differnet datasets to analyze the city based on two different aspects: Employment Status, and Household Income. #
The main idea of the dataset is to do an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File. This dataset analysis and holds information about the employment status in different blocks, zones, of Mecklenburg, NC. Looking at the dataset, there are a couple of attributes that are easy to identify what they hold, such as: "Not in labor force", or "Population 16 years and over". However, there are others that as not as obvious as "LOGRECNO", or "Shape_Area". Thus, I will be determining importance as I move forward analyzing and pre-processing the data.
Features of this dataset:
Like wise, the main idea of the dataset is to, also, do an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File. Thus, the dataset holds information about the Household income of different blocks, zones, of Mecklenburg, NC. This dataset also identifiable attributes, such as: "Total Families", or "Family Poverty Rate". However, there are others that as not as obvious, at least for someone not as familiar with the topic, as "TRACTCE10". Features of the dataset:
Working with real-world data from Charlotte, NC, is a new and exciting challenge for me. This is my first time diving into actual town data, and it's a bit different from the clean, perfect datasets we've used in class. Real-world data can be messy, with missing pieces and inconsistencies that need to be sorted out. I'm looking forward to exploring the socio-economic trends in Charlotte and seeing what patterns emerge. It's a great opportunity to learn more about the community and improve my data analysis skills in a practical way
I will be analyzing the datasets and see any stats to see what path to take
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from itertools import combinations
# read csv
data_employment = pd.read_csv('data/Census_Employment_Status_Block_Groups.csv')
data_income = pd.read_csv('data/Census_Household_Income_Block_Groups.csv')
# convert to dataframe
df_emp = pd.DataFrame(data_employment)
df_inc = pd.DataFrame(data_income)
df_emp.head()
| OBJECTID | STATEFP10 | COUNTYFP10 | TRACTCE10 | BLKGRPCE10 | GEOID10 | NAMELSAD10 | LOGRECNO | geoname | Population_16_years_and_over | In_labor_force | Civilian_labor_force | Civilian_labor_force_employed | Civilian_labor_force_unemployed | Armed_forces | Not_in_labor_force | Unemployment_Rate | GlobalID | Shape__Area | Shape__Length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3331 | 37 | 119 | 1920 | 3 | 371190019203 | Block Group 3 | 16225 | Block Group 3, Census Tract 19.20, Mecklenburg... | 691 | 499 | 499 | 470 | 29 | 0 | 192 | 0.058116 | ad803924-2839-4716-af48-7f5b22ed00f1 | 5.033422e+06 | 10882.286117 |
| 1 | 3332 | 37 | 119 | 5846 | 2 | 371190058462 | Block Group 2 | 16578 | Block Group 2, Census Tract 58.46, Mecklenburg... | 1781 | 1104 | 1104 | 1034 | 70 | 0 | 677 | 0.063406 | 429648b4-3928-465e-862f-9d6d74710b17 | 2.742804e+07 | 25848.887577 |
| 2 | 3333 | 37 | 119 | 5510 | 1 | 371190055101 | Block Group 1 | 16423 | Block Group 1, Census Tract 55.10, Mecklenburg... | 1368 | 988 | 988 | 937 | 51 | 0 | 380 | 0.051619 | f5b337c0-03bc-4e9b-ba25-23e28ed28f3e | 4.013663e+07 | 27127.427970 |
| 3 | 3334 | 37 | 119 | 6404 | 1 | 371190064041 | Block Group 1 | 16674 | Block Group 1, Census Tract 64.04, Mecklenburg... | 1412 | 846 | 846 | 839 | 7 | 0 | 566 | 0.008274 | 944da35c-5241-4c86-a035-83de739cbdfb | 1.214716e+08 | 51184.172136 |
| 4 | 3335 | 37 | 119 | 6405 | 1 | 371190064051 | Block Group 1 | 16676 | Block Group 1, Census Tract 64.05, Mecklenburg... | 2385 | 1649 | 1649 | 1632 | 17 | 0 | 736 | 0.010309 | 26cd14b2-6cce-4e6c-8fcc-130bc5768d3b | 2.954331e+07 | 25281.772955 |
df_inc.head()
| OBJECTID | STATEFP10 | COUNTYFP10 | TRACTCE10 | BLKGRPCE10 | GEOID10 | NAMELSAD10 | LOGRECNO | geoname | Total_Households | ... | i125000_to_149999 | i150000_to_199999 | i200000_or_more | Median_Household_Income | TotalFamilies | FamiliesInPoverty | FamilyPovertyRate | GlobalID | Shape__Area | Shape__Length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3331 | 37 | 119 | 1609 | 2 | 371190016092 | Block Group 2 | 16185 | Block Group 2, Census Tract 16.09, Mecklenburg... | 995 | ... | 32 | 10 | 0 | 45963 | 558 | 149 | 0.267025 | 0604ad89-6954-4390-8e68-9f8f6992e321 | 1.447086e+07 | 18153.991511 |
| 1 | 3332 | 37 | 119 | 3109 | 1 | 371190031091 | Block Group 1 | 16319 | Block Group 1, Census Tract 31.09, Mecklenburg... | 1101 | ... | 0 | 37 | 0 | 43609 | 513 | 107 | 0.208577 | 9378d28f-3531-4e16-b807-5e081a04ef38 | 5.838492e+06 | 11934.206452 |
| 2 | 3333 | 37 | 119 | 6405 | 2 | 371190064052 | Block Group 2 | 16677 | Block Group 2, Census Tract 64.05, Mecklenburg... | 1101 | ... | 30 | 147 | 147 | 85258 | 622 | 14 | 0.022508 | 330fa39f-581e-4c55-8a34-164cfbe9e428 | 9.416605e+07 | 46805.884093 |
| 3 | 3334 | 37 | 119 | 2702 | 1 | 371190027021 | Block Group 1 | 16260 | Block Group 1, Census Tract 27.02, Mecklenburg... | 842 | ... | 103 | 70 | 368 | 180175 | 550 | 14 | 0.025455 | 862bc825-2b8d-4a61-941f-0e6031cad9a7 | 1.737750e+07 | 18116.211387 |
| 4 | 3335 | 37 | 119 | 3013 | 1 | 371190030131 | Block Group 1 | 16296 | Block Group 1, Census Tract 30.13, Mecklenburg... | 1065 | ... | 81 | 152 | 522 | 194514 | 849 | 14 | 0.016490 | 27ee8e5c-14df-419b-be4e-c53327dd2454 | 3.709292e+07 | 36604.354128 |
5 rows × 33 columns
# shape of dataframes
print(f'Employment dataset shape: {df_emp.shape}, Household Income dataset shape: {df_inc.shape}')
Employment dataset shape: (555, 20), Household Income dataset shape: (555, 33)
# data types of dataframes
print(f'df_emp:{df_emp.dtypes} df_inc:{df_inc.dtypes}')
df_emp:OBJECTID int64 STATEFP10 int64 COUNTYFP10 int64 TRACTCE10 int64 BLKGRPCE10 int64 GEOID10 int64 NAMELSAD10 object LOGRECNO int64 geoname object Population_16_years_and_over int64 In_labor_force int64 Civilian_labor_force int64 Civilian_labor_force_employed int64 Civilian_labor_force_unemployed int64 Armed_forces int64 Not_in_labor_force int64 Unemployment_Rate float64 GlobalID object Shape__Area float64 Shape__Length float64 dtype: object df_inc:OBJECTID int64 STATEFP10 int64 COUNTYFP10 int64 TRACTCE10 int64 BLKGRPCE10 int64 GEOID10 int64 NAMELSAD10 object LOGRECNO int64 geoname object Total_Households int64 Less_than_10000 int64 i10000_to_14999 int64 i15000_to19999 int64 i20000_to_24999 int64 i25000_to_29999 int64 i30000_to_34999 int64 i35000_to_39999 int64 i40000_to_44999 int64 i45000_to_49999 int64 i50000_to_59999 int64 i60000_to_74999 int64 i75000_to_99999 int64 i100000_to_124999 int64 i125000_to_149999 int64 i150000_to_199999 int64 i200000_or_more int64 Median_Household_Income int64 TotalFamilies int64 FamiliesInPoverty int64 FamilyPovertyRate float64 GlobalID object Shape__Area float64 Shape__Length float64 dtype: object
Most of the dataset are numbers already so it will be easier to manupulate. It seems like the datasets cover the same blocks or zones on Charlotte. However, when trying to merge them, they only had 5 rows in common. Thus, we will be working on them separately.
# get unique values of the dataframes
for k, v in df_emp.items():
print(f'unique items: {df_emp[k].nunique()} feature: {k}')
print('__________')
for k, v in df_inc.items():
print(f'unique items: {df_inc[k].nunique()} feature: {k}')
unique items: 555 feature: OBJECTID unique items: 1 feature: STATEFP10 unique items: 1 feature: COUNTYFP10 unique items: 233 feature: TRACTCE10 unique items: 6 feature: BLKGRPCE10 unique items: 555 feature: GEOID10 unique items: 6 feature: NAMELSAD10 unique items: 555 feature: LOGRECNO unique items: 555 feature: geoname unique items: 490 feature: Population_16_years_and_over unique items: 475 feature: In_labor_force unique items: 473 feature: Civilian_labor_force unique items: 471 feature: Civilian_labor_force_employed unique items: 148 feature: Civilian_labor_force_unemployed unique items: 21 feature: Armed_forces unique items: 407 feature: Not_in_labor_force unique items: 475 feature: Unemployment_Rate unique items: 555 feature: GlobalID unique items: 555 feature: Shape__Area unique items: 555 feature: Shape__Length __________ unique items: 555 feature: OBJECTID unique items: 1 feature: STATEFP10 unique items: 1 feature: COUNTYFP10 unique items: 233 feature: TRACTCE10 unique items: 6 feature: BLKGRPCE10 unique items: 555 feature: GEOID10 unique items: 6 feature: NAMELSAD10 unique items: 555 feature: LOGRECNO unique items: 555 feature: geoname unique items: 445 feature: Total_Households unique items: 126 feature: Less_than_10000 unique items: 95 feature: i10000_to_14999 unique items: 108 feature: i15000_to19999 unique items: 106 feature: i20000_to_24999 unique items: 112 feature: i25000_to_29999 unique items: 125 feature: i30000_to_34999 unique items: 113 feature: i35000_to_39999 unique items: 117 feature: i40000_to_44999 unique items: 105 feature: i45000_to_49999 unique items: 157 feature: i50000_to_59999 unique items: 170 feature: i60000_to_74999 unique items: 207 feature: i75000_to_99999 unique items: 175 feature: i100000_to_124999 unique items: 141 feature: i125000_to_149999 unique items: 160 feature: i150000_to_199999 unique items: 194 feature: i200000_or_more unique items: 542 feature: Median_Household_Income unique items: 421 feature: TotalFamilies unique items: 134 feature: FamiliesInPoverty unique items: 423 feature: FamilyPovertyRate unique items: 555 feature: GlobalID unique items: 555 feature: Shape__Area unique items: 555 feature: Shape__Length
After analysis the data, I believe I will be dropping the following features: OBJECTID, STATEFP10, COUNTRYFP10, GEOID10, NAMELSAD10, LOGRECNO, geoname, GlobalID. I will be dropping these values since they are just unique identifiers of the area. Although they are important when it comes to the bigger picture and analysis. However, I would not need those identifiers since we are working with the same city. Thus, the information is irrelevant because it is all the same.
# drop irrelevant columns
df_emp = df_emp.drop(['OBJECTID','STATEFP10', 'COUNTYFP10', 'GEOID10', 'NAMELSAD10', 'LOGRECNO', 'geoname', 'GlobalID'], axis=1)
df_emp
| TRACTCE10 | BLKGRPCE10 | Population_16_years_and_over | In_labor_force | Civilian_labor_force | Civilian_labor_force_employed | Civilian_labor_force_unemployed | Armed_forces | Not_in_labor_force | Unemployment_Rate | Shape__Area | Shape__Length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1920 | 3 | 691 | 499 | 499 | 470 | 29 | 0 | 192 | 0.058116 | 5.033422e+06 | 10882.286117 |
| 1 | 5846 | 2 | 1781 | 1104 | 1104 | 1034 | 70 | 0 | 677 | 0.063406 | 2.742804e+07 | 25848.887577 |
| 2 | 5510 | 1 | 1368 | 988 | 988 | 937 | 51 | 0 | 380 | 0.051619 | 4.013663e+07 | 27127.427970 |
| 3 | 6404 | 1 | 1412 | 846 | 846 | 839 | 7 | 0 | 566 | 0.008274 | 1.214716e+08 | 51184.172136 |
| 4 | 6405 | 1 | 2385 | 1649 | 1649 | 1632 | 17 | 0 | 736 | 0.010309 | 2.954331e+07 | 25281.772955 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1702 | 1 | 958 | 699 | 699 | 661 | 38 | 0 | 259 | 0.054363 | 1.897104e+07 | 19063.699775 |
| 551 | 5621 | 1 | 1898 | 1387 | 1372 | 1352 | 20 | 15 | 511 | 0.014420 | 4.949538e+07 | 31190.061156 |
| 552 | 3008 | 3 | 1656 | 951 | 951 | 951 | 0 | 0 | 705 | 0.000000 | 2.418330e+07 | 21947.985177 |
| 553 | 5511 | 3 | 1244 | 949 | 949 | 941 | 8 | 0 | 295 | 0.008430 | 5.156997e+06 | 10629.623345 |
| 554 | 5823 | 1 | 1107 | 638 | 638 | 638 | 0 | 0 | 469 | 0.000000 | 4.155662e+07 | 28498.331604 |
555 rows × 12 columns
# drop irrelevant columns
df_inc = df_inc.drop(['OBJECTID','STATEFP10', 'COUNTYFP10', 'GEOID10', 'NAMELSAD10', 'LOGRECNO', 'geoname', 'GlobalID'], axis=1)
df_inc
| TRACTCE10 | BLKGRPCE10 | Total_Households | Less_than_10000 | i10000_to_14999 | i15000_to19999 | i20000_to_24999 | i25000_to_29999 | i30000_to_34999 | i35000_to_39999 | ... | i100000_to_124999 | i125000_to_149999 | i150000_to_199999 | i200000_or_more | Median_Household_Income | TotalFamilies | FamiliesInPoverty | FamilyPovertyRate | Shape__Area | Shape__Length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1609 | 2 | 995 | 111 | 37 | 46 | 73 | 66 | 38 | 87 | ... | 68 | 32 | 10 | 0 | 45963 | 558 | 149 | 0.267025 | 1.447086e+07 | 18153.991511 |
| 1 | 3109 | 1 | 1101 | 40 | 11 | 44 | 136 | 102 | 125 | 25 | ... | 49 | 0 | 37 | 0 | 43609 | 513 | 107 | 0.208577 | 5.838492e+06 | 11934.206452 |
| 2 | 6405 | 2 | 1101 | 20 | 46 | 49 | 40 | 28 | 57 | 25 | ... | 141 | 30 | 147 | 147 | 85258 | 622 | 14 | 0.022508 | 9.416605e+07 | 46805.884093 |
| 3 | 2702 | 1 | 842 | 18 | 14 | 0 | 25 | 12 | 58 | 0 | ... | 30 | 103 | 70 | 368 | 180175 | 550 | 14 | 0.025455 | 1.737750e+07 | 18116.211387 |
| 4 | 3013 | 1 | 1065 | 37 | 14 | 0 | 0 | 0 | 16 | 0 | ... | 62 | 81 | 152 | 522 | 194514 | 849 | 14 | 0.016490 | 3.709292e+07 | 36604.354128 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 1801 | 2 | 168 | 13 | 9 | 18 | 16 | 0 | 6 | 6 | ... | 6 | 19 | 0 | 21 | 59750 | 86 | 0 | 0.000000 | 7.258220e+06 | 16132.903715 |
| 551 | 5403 | 2 | 600 | 25 | 74 | 24 | 19 | 8 | 46 | 22 | ... | 61 | 0 | 15 | 0 | 50368 | 324 | 49 | 0.151235 | 2.454919e+07 | 23960.730077 |
| 552 | 5817 | 2 | 1135 | 41 | 9 | 9 | 7 | 0 | 15 | 23 | ... | 90 | 100 | 190 | 295 | 129028 | 936 | 35 | 0.037393 | 2.943564e+07 | 23981.466563 |
| 553 | 5306 | 1 | 243 | 48 | 6 | 5 | 7 | 4 | 43 | 28 | ... | 16 | 0 | 0 | 0 | 35850 | 112 | 6 | 0.053571 | 7.114165e+06 | 16261.887420 |
| 554 | 1923 | 1 | 1051 | 14 | 7 | 33 | 85 | 49 | 76 | 36 | ... | 40 | 34 | 13 | 12 | 49795 | 682 | 70 | 0.102639 | 1.538164e+07 | 20967.114434 |
555 rows × 25 columns
The dataset is pretty clean so I will be just normalizing some features and adapting some others to then start with visualizations and plugging the datasets into graphs
# inititalize the MinMaxScaler
scaler = MinMaxScaler()
# fit and transform employment data
data_empN = scaler.fit_transform(df_emp)
df_empN = pd.DataFrame(data_empN)
df_empN
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001857 | 0.4 | 0.149859 | 0.136862 | 0.136862 | 0.133333 | 0.081690 | 0.000000 | 0.079668 | 0.151102 | 0.008386 | 0.048437 |
| 1 | 0.005862 | 0.2 | 0.386250 | 0.302798 | 0.302798 | 0.293333 | 0.197183 | 0.000000 | 0.280913 | 0.164855 | 0.068923 | 0.184690 |
| 2 | 0.005519 | 0.0 | 0.296682 | 0.270982 | 0.270982 | 0.265816 | 0.143662 | 0.000000 | 0.157676 | 0.134211 | 0.103276 | 0.196329 |
| 3 | 0.006431 | 0.0 | 0.306224 | 0.232035 | 0.232035 | 0.238014 | 0.019718 | 0.000000 | 0.234855 | 0.021513 | 0.323141 | 0.415336 |
| 4 | 0.006432 | 0.0 | 0.517241 | 0.452276 | 0.452276 | 0.462979 | 0.047887 | 0.000000 | 0.305394 | 0.026804 | 0.074641 | 0.179527 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 0.001634 | 0.0 | 0.207764 | 0.191717 | 0.191717 | 0.187518 | 0.107042 | 0.000000 | 0.107469 | 0.141345 | 0.046062 | 0.122919 |
| 551 | 0.005633 | 0.0 | 0.411624 | 0.380417 | 0.376303 | 0.383546 | 0.056338 | 0.245902 | 0.212033 | 0.037491 | 0.128575 | 0.233315 |
| 552 | 0.002967 | 0.4 | 0.359141 | 0.260834 | 0.260834 | 0.269787 | 0.000000 | 0.000000 | 0.292531 | 0.000000 | 0.060152 | 0.149177 |
| 553 | 0.005520 | 0.4 | 0.269790 | 0.260285 | 0.260285 | 0.266950 | 0.022535 | 0.000000 | 0.122407 | 0.021918 | 0.008720 | 0.046137 |
| 554 | 0.005839 | 0.0 | 0.240078 | 0.174986 | 0.174986 | 0.180993 | 0.000000 | 0.000000 | 0.194606 | 0.000000 | 0.107115 | 0.208810 |
555 rows × 12 columns
# fit and transform household income data
data_incN = scaler.fit_transform(df_inc)
df_incN = pd.DataFrame(data_incN)
df_incN
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001539 | 0.2 | 0.473584 | 0.242358 | 0.201087 | 0.230 | 0.410112 | 0.308411 | 0.156379 | 0.432836 | ... | 0.163855 | 0.125984 | 0.021834 | 0.000000 | 0.183852 | 0.346584 | 0.356459 | 0.332582 | 0.033897 | 0.114637 |
| 1 | 0.003070 | 0.0 | 0.524036 | 0.087336 | 0.059783 | 0.220 | 0.764045 | 0.476636 | 0.514403 | 0.124378 | ... | 0.118072 | 0.000000 | 0.080786 | 0.000000 | 0.174436 | 0.318634 | 0.255981 | 0.259785 | 0.010562 | 0.058014 |
| 2 | 0.006432 | 0.2 | 0.524036 | 0.043668 | 0.250000 | 0.245 | 0.224719 | 0.130841 | 0.234568 | 0.124378 | ... | 0.339759 | 0.118110 | 0.320961 | 0.221386 | 0.341032 | 0.386335 | 0.033493 | 0.028034 | 0.249329 | 0.375477 |
| 3 | 0.002655 | 0.0 | 0.400762 | 0.039301 | 0.076087 | 0.000 | 0.140449 | 0.056075 | 0.238683 | 0.000000 | ... | 0.072289 | 0.405512 | 0.152838 | 0.554217 | 0.720700 | 0.341615 | 0.033493 | 0.031704 | 0.041754 | 0.114293 |
| 4 | 0.002972 | 0.0 | 0.506901 | 0.080786 | 0.076087 | 0.000 | 0.000000 | 0.000000 | 0.065844 | 0.000000 | ... | 0.149398 | 0.318898 | 0.331878 | 0.786145 | 0.778056 | 0.527329 | 0.033493 | 0.020538 | 0.095049 | 0.282605 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 550 | 0.001735 | 0.2 | 0.079962 | 0.028384 | 0.048913 | 0.090 | 0.089888 | 0.000000 | 0.024691 | 0.029851 | ... | 0.014458 | 0.074803 | 0.000000 | 0.031627 | 0.239000 | 0.053416 | 0.000000 | 0.000000 | 0.014400 | 0.096238 |
| 551 | 0.005410 | 0.2 | 0.285578 | 0.054585 | 0.402174 | 0.120 | 0.106742 | 0.037383 | 0.189300 | 0.109453 | ... | 0.146988 | 0.000000 | 0.032751 | 0.000000 | 0.201472 | 0.201242 | 0.117225 | 0.188364 | 0.061141 | 0.167500 |
| 552 | 0.005832 | 0.2 | 0.540219 | 0.089520 | 0.048913 | 0.045 | 0.039326 | 0.000000 | 0.061728 | 0.114428 | ... | 0.216867 | 0.393701 | 0.414847 | 0.444277 | 0.516112 | 0.581366 | 0.083732 | 0.046574 | 0.074350 | 0.167689 |
| 553 | 0.005311 | 0.0 | 0.115659 | 0.104803 | 0.032609 | 0.025 | 0.039326 | 0.018692 | 0.176955 | 0.139303 | ... | 0.038554 | 0.000000 | 0.000000 | 0.000000 | 0.143400 | 0.069565 | 0.014354 | 0.066724 | 0.014010 | 0.097412 |
| 554 | 0.001860 | 0.0 | 0.500238 | 0.030568 | 0.038043 | 0.165 | 0.477528 | 0.228972 | 0.312757 | 0.179104 | ... | 0.096386 | 0.133858 | 0.028384 | 0.018072 | 0.199180 | 0.423602 | 0.167464 | 0.127838 | 0.036359 | 0.140247 |
555 rows × 25 columns
I will be using PCA to visualize the data and understand where the clusters are being formed. I want to also test how efficient PCA is with normalized data. I will be also just building visualizations to see if I can detect any patterns.
from sklearn.decomposition import PCA
sns.pairplot(df_empN)
C:\Users\dagon\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x2200d93ca90>
It seems like there are some relationships in this dataset that can be great for clustering and finding patterns with the data for everything that is not a straight line
plt.figure(figsize=(15,8))
plt.title('Employment in Different Blocks')
sns.heatmap(df_inc.corr(numeric_only=True), annot=True)
<Axes: title={'center': 'Employment in Different Blocks'}>
there are plenty of features that have some type of correlation with each other. Some of them are correlations what we care about and others not quite as much (such as shape_area and shape_length)
pca = PCA(n_components=3)
pca_mdl = pca.fit_transform(df_empN)
pca_df = pd.DataFrame(pca_mdl, columns=['pca1', 'pca2', 'pca3'])
pca_df
| pca1 | pca2 | pca3 | |
|---|---|---|---|
| 0 | -0.393082 | 0.170986 | 0.044172 |
| 1 | 0.067810 | 0.034012 | 0.055509 |
| 2 | -0.022650 | -0.184531 | 0.007167 |
| 3 | -0.031725 | -0.200484 | -0.172885 |
| 4 | 0.338878 | -0.120808 | -0.200699 |
| ... | ... | ... | ... |
| 550 | -0.207076 | -0.207946 | 0.023142 |
| 551 | 0.181165 | -0.150613 | -0.166799 |
| 552 | -0.076219 | 0.220676 | -0.178359 |
| 553 | -0.170404 | 0.209266 | -0.131379 |
| 554 | -0.201701 | -0.208648 | -0.156646 |
555 rows × 3 columns
sns.scatterplot(x=pca_df['pca2'], y=pca_df['pca3'], hue=pca_df['pca2'])
<Axes: xlabel='pca2', ylabel='pca3'>
There are also a lot of visuals in this dataset that cna give us an idea in the grouping of it
sns.pairplot(df_incN)
C:\Users\dagon\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x220134d70d0>
plt.figure(figsize=(15,8))
plt.title('Household Income in Different Blocks')
sns.heatmap(df_inc.corr(numeric_only=True), annot=True)
<Axes: title={'center': 'Household Income in Different Blocks'}>
after these visuals, we can have an idea of where the data leads and the correlations among features
from sklearn.cluster import KMeans
inertia = []
for k in range(1,8):
kmeans = KMeans(n_clusters=k, random_state=1).fit(df_empN)
inertia.append(np.sqrt(kmeans.inertia_))
C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn(
plt.plot(range(1, 8), inertia, marker='s');
plt.xlabel('$k$')
plt.ylabel('Variance')
Text(0, 0.5, 'Variance')
After the graph, I will be picking 3 as the k value for the cluster
kmeans = KMeans(n_clusters=3, random_state=42).fit(df_empN)
C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn(
df_emp['Cluster'] = kmeans.labels_
0 2
1 0
2 0
3 0
4 1
..
550 2
551 0
552 0
553 2
554 2
Name: Cluster, Length: 555, dtype: int32
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='pca1', y='pca3', hue='Cluster', palette='Set1', s=100)
plt.title('PCA Results with KMeans Clustering')
plt.show()
Here, we can see how the clusters are formed with the pca dataframe and the clustering model separating the groups. Therefore, there is a relationship in groups for the dataset
sns.pairplot(df_emp, hue='Cluster')
C:\Users\dagon\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.PairGrid at 0x1dd716ffbd0>
The scatter plot looks very similar to some of the pairplots above. Thus, we can have an idea of the features that were considered for the clustering model such as: 'Population_16_years_and_over' vs 'Unemployment_Rate'; 'Civilian_labor_force_emploted' vs 'Not_in_labor_force'; etc.
inertia = []
for k in range(1,8):
kmeans = KMeans(n_clusters=k, random_state=1).fit(df_incN)
inertia.append(np.sqrt(kmeans.inertia_))
C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn( C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn(
plt.plot(range(1, 8), inertia, marker='s');
plt.xlabel('$k$')
plt.ylabel('Variance')
Text(0, 0.5, 'Variance')
I picked a k of 3 also for the clusterings as well sicne it seems to be doing the best
kmeans = KMeans(n_clusters=3, random_state=43).fit(df_incN)
C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) C:\Users\dagon\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=3. warnings.warn(
I also started the pca for this dataset with 3 components for when it comes to visualizing the dataset
# pca for household income
pca = PCA(n_components=3)
pca_mdl = pca.fit_transform(df_incN)
pca_df = pd.DataFrame(pca_mdl, columns=['pca1', 'pca2', 'pca3'])
df_inc['Cluster'] = kmeans.labels_
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='pca2', y='pca1', hue='Cluster', palette='Set1', s=100)
plt.title('PCA Results with KMeans Clustering')
plt.show()
Here, we can also see that the data tend to split into 3 clusters and it is somewhat splitted evenly
cluster_analysis = df_inc.groupby('Cluster').count()
cluster_analysis.T
| Cluster | 0 | 1 | 2 |
|---|---|---|---|
| TRACTCE10 | 263 | 152 | 140 |
| BLKGRPCE10 | 263 | 152 | 140 |
| Total_Households | 263 | 152 | 140 |
| Less_than_10000 | 263 | 152 | 140 |
| i10000_to_14999 | 263 | 152 | 140 |
| i15000_to19999 | 263 | 152 | 140 |
| i20000_to_24999 | 263 | 152 | 140 |
| i25000_to_29999 | 263 | 152 | 140 |
| i30000_to_34999 | 263 | 152 | 140 |
| i35000_to_39999 | 263 | 152 | 140 |
| i40000_to_44999 | 263 | 152 | 140 |
| i45000_to_49999 | 263 | 152 | 140 |
| i50000_to_59999 | 263 | 152 | 140 |
| i60000_to_74999 | 263 | 152 | 140 |
| i75000_to_99999 | 263 | 152 | 140 |
| i100000_to_124999 | 263 | 152 | 140 |
| i125000_to_149999 | 263 | 152 | 140 |
| i150000_to_199999 | 263 | 152 | 140 |
| i200000_or_more | 263 | 152 | 140 |
| Median_Household_Income | 263 | 152 | 140 |
| TotalFamilies | 263 | 152 | 140 |
| FamiliesInPoverty | 263 | 152 | 140 |
| FamilyPovertyRate | 263 | 152 | 140 |
| Shape__Area | 263 | 152 | 140 |
| Shape__Length | 263 | 152 | 140 |
There can be see that there are different clusters in the two datasets with the Employment dataset in Charlotte and the Household income dataset in Charlotte as well. These clusters can be useful to understanding the data more in depth. I chose hte k-means algorithm because, after visualizing the datasets, it seemed to be a good fit for those.
There are many features that are relevant to determining what datapoint belongs into what cluster. As visualized in the different graphs, we can have the idea for the following:
With these features, we were able to see the comparison and grouping of different areas or zones in Charlotte and make relations of one another. It was very interesting workign with this dataset and seeing the possibilities come into play. Thus, I believe that the insigths were very interesting, as well as the flow of the dataframes